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Abstract. In this paper we describe a data structure that supports pat- 
tern matching queries on a dynamically arriving text over an alphabet of 
constant size. Each new symbol can be prepended to T in O(l) expected 
worst-case time. At any moment, we can report all occurrences of a pat- 
tern P in the current text in 0(\P\ + k) time, where \P\ is the length of 
P and k is the number of occurrences. This resolves, under assumption 
of constant size alphabet, a long-standing open problem of existence of 
a real-time indexing method for string matching (see [2]). 

1 Introduction 

Two main versions of the string matching problem differ in which of the 
two components - pattern P or text T - is provided first in the input (or 
is considered as fixed) and can then be preprocessed before processing the 
other component. The framework when the text has to be preprocessed is 
usually called indexing, as it can be viewed as constructing a text index 
supporting matching queries. 

Real-time variants of the string matching problem are about as old 
as the string matching itself. In the 70s, existence of real-time string 
matching algorithms was first studied for Turing machines. For example, 
it has been shown that the language {P#T} where P occurs in T can 
be recognized by a Turing machine, while the language {T#P} cannot 
[7]. In the realm of the RAM model, the real-time variant of pattern- 
preprocessing string matching has been extensively studied, leading to 
very efficient solutions (see e.g. [3] and references therein). The indexing 
variant, however, still has important unsolved questions. 

* This work was done while this author was at Laboratoire d'Informatique Gaspard 
Monge, Universite Paris-Est & CNRS 



In the real-time indexing problem, we have to maintain an indexing 
data structure for a text that arrives online, by spending O(l) worst-case 
time on each new character; a string matching query must be answered in 
0(|P|) time for a query string P. Back in the 70s, Slisenko [15] claimed a 
solution to the real-time indexing problem, but its complex and volumi- 
nous full description made it unacknowledged by the scientific community, 
and the problem remained to be considered open for many years. In 1994, 
Kosaraju [11] presented another solution which, however, did not support 
repetitive matching queries on different portions of arriving text, but as- 
sumed that the text is entirely read before the matching query is made. 
In 2008, Amir and Nor [2] proposed another algorithm that fixes this 
drawback and allows queries to be made at any moment of the text scan. 

All the three existing real-time indexing solutions [15,11,2] support 
only existential queries asking whether the pattern occurs in the text, 
but are unable to report occurrences of the pattern. Designing a real-time 
text indexing algorithm that would support queries on all occurrences of 
a pattern is stated in [2] as the most important remaining open problem. 
The algorithms of [11, 2] assume a constant size alphabet and are both 
based on constructions of "incomplete" suffix trees which can be built real- 
time but can only answer existential queries. To output all occurrences of 
a pattern, a fully-featured suffix tree is needed, however a real-time suffix 
tree construction, first studied in [1], is in itself an open question. The 
best currently known algorithm [4] spends O (log log n) worst-case time 
on each character, and a truly real-time construction seems unlikely to 
exist. Therefore, a suffix tree alone seems to be insufficient to solve the 
real-time indexing problem. 

In this paper, we propose the first real-time text indexing solution 
that supports reporting all pattern occurrences, under the assumption 
of constant size alphabet. The general idea is to maintain several data 
structures, three in our case, each supporting queries for different pattern 
lengths. Our method employs the suffix tree construction technique re- 
cently proposed by Kopelowitz [9] . Similar to [9] and to previous real-time 
indexing solutions [11, 2], we assume that the text is read right-to- left, or 
otherwise the pattern needs to be reversed before executing the query. 
We use the word RAM computation model; the same model is also used 
in e.g., [4,9]. 

The paper is organized as follows. In Section 2.1, we describe auxil- 
iary data structures and Kopelowitz' technique that are essential for our 
algorithm. In Section 3, we describe the three data structures for differ- 
ent pattern lengths that constitute a basis of our solution. These data 
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structures, however, do not provide a fully real-time algorithm. Then in 
Section 4, we show how to "fix" the solution of Section 3 in order to obtain 
a fully real-time algorithm. 

Throughout the paper, U is an alphabet of constant size a. Since 
the text T is read right-to- left, it will be convenient for us to enumerate 
symbols of T from the end, i.e. T = t n . . . t\ and substring . . . t j 

will be denoted T[i + T[i..] denotes suffix T[i..l}. Throughout this 
paper, we reserve k to denote the number of objects (occurrences of a 
pattern, elements in a list, etc) in the query answer. 

2 Preliminaries 

In this Section, we describe main algorithmic tools used by our algorithms. 

2.1 Range Reporting and Predecessor Queries on Colored 
Lists 

We use data structures from [13] for searching in dynamic colored lists. 

Colored Range Reporting in a List. Let elements of a dynamic linked list C 
be assigned positive integer values called colors. A colored range reporting 
query on a list C consists of two integers col\ < C0I2 and two pointers ptr\ 
and ptr2 that point to elements e\ and ei of C An answer to a colored 
range reporting query consists of all elements e £ C occurring between e\ 
and e2 (including e\ and e-i) such that col\ < col(e) < C0I2, where col(e) 
is the color of e. The following result on colored range reporting has been 
proved by Mortensen [13]. 

Lemma 1 ([13]). Suppose that col(e) < log-^n for all e £ C and some 
constant f < 1/4. We can answer color range reporting queries on C in 
0(log log m + k) time using an 0(m)-space data structure, where m is the 
number of elements in C. Insertion of a new element into C is supported 
in O(loglogm) time. 

Note that the bound / < 1/4 follows from the description in [12]: the 
data structure in [13] uses Q-heaps [6] to answer certain queries on the 
set of colors in constant time. 

Colored Predecessor Problem. The colored predecessor query on a list C 
consists of an element e 6 C and a color col. The answer to a query (e, col) 
is the closest element e' G C which precedes e such that col(e) = col. The 
following Lemma is also proved in [13]; we also refer to [8], where a similar 
problem is solved. 
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Lemma 2 ([13]). Suppose that col(e) < log-* n for all e £ C and some 
constant f < 1/4. There exists an 0{m) space data structure that an- 
swers colored predecessor queries on C in O(loglogm) time and supports 
insertions in O(loglogm) time, where m is the number of elements in C. 

2.2 On-Line Indexing for Alphabets of Small Size 

Kopelowitz [9] described an online indexing method that works for an 
arbitrarily large alphabet A. We describe below a simplified version of his 
algorithm, adapted to our purposes, for the case when the alphabet size 
\A\ < log 1//4 n. For a current text T, Kopelowitz' algorithm maintains a 
list S of its lexicographically sorted suffixes and a suffix tree 7~. 4 Besides, 
the following auxiliary data structures are used. For any symbol a € A 
that occurs in T at least once, we store a in a data structure A. Since 
A contains at most log 1//4 n elements, we can search in A in 0(1) time 
using Q-heaps [6]. For every a in A, we store a pointer last(a) to the last 
(lexicographically largest) suffix of T that starts with a. Furthermore, 
every suffix T[i..] in S is colored with color tj+i, i.e., the color of T[i..] is 
the symbol that precedes the starting position of T[i..] in T. We maintain 
the structure T> for colored predecessor queries on S, as in Lemma 2. 
For each T[i..] in S, we also store a pointer to the suffix T[i + 1..] in 
S. Finally, we store a data structure for weighted level ancestor queries 
on T: for any leaf v of T and for any integer q, we can find the lowest 
ancestor u of v such that the string depth of u is smaller than q. The 
data structure from [10] uses linear space, supports dynamic tree updates 
in expected O(loglogn) time, and answers the weighted level ancestor 
queries in worst-case O (log log n) time. 

The algorithm of Kopelowitz [9] consists of three phases. During the 
first phase we prepend a symbol i n +i to the current text T = t n . . .t\ and 
find the position of the new suffix t n +iT in the lexicographically ordered 
list of suffixes. To do this, we first check if t n+ \ occurs at least once in 
T. We query A for the largest symbol a < t n +i; if a ^ t n+ i, then the 
suffix t n+ \T should be inserted after last(a) into S. If a = t n +i, at least 
one suffix of T starts with t n+ \. Using V, we look for the predecessor 
of T[n..} in S colored with t n+ \. Let T[j..] denote such a predecessor of 
T[n..]. Then T[j + 1..] starts with symbol t n +i, and it is easy to check 
that T[j + 1..] is the lexicographically largest suffix that precedes t n+ \T. 

When we know suffixes S' = T[i'..] and S" = T[i" '..] of T that respec- 
tively precede and follow t n+ \T, we can find the longest common prefixes 

4 In subsequent sections we will consider 5 to be a part of T. 
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£' = lcp(t n +\T, S') and £" = lcp(t n+ iT, S") in 0(1) time using the data 
structure of [5]. If £' > £", we find the leaf v of T that holds S' and the 
lowest ancestor u of v with string depth at most £'; u is found using the 
dynamic weighted level ancestor structure from [10]. If the string depth 
of u equals £', we create a new child of u that holds the new suffix t n+ \T. 
Otherwise, let w be a child of u which is an ancestor of v. We split the 
edge from u to w and create a new node u'. Then, we create a new child 
of u' that holds t Q T. The case £" > £' is symmetric. When the new suffix 
is inserted into the suffix tree, we update all the auxiliary data structures 
during the third phase. We refer to [9] for a detailed description of the 
algorithm. 

The only difference between the described procedure and the original 
algorithm of Kopelowitz [9] is that our method assumes an alphabet of 
size \A\ < log 1 / 4 ?!, which allows us to employ the colored predecessor 
data structure to search for the position of suffix t n+ \T during the first 
phase. The algorithm in [9] works for an arbitrarily large alphabet, and 
therefore requires more complicated data structures. 

3 Fast Off-Line Solution 

In this section we describe the main part of our algorithm. The algorithm 
updates the index by reading the text in the right-to-left order. However, 
the algorithm we describe now will not be on-line, as it will have to access 
symbols to the left of the currently processed symbol. Another "flaw" of 
the algorithm is that it will support pattern matching queries only with 
an additional exception: we will be able to report all occurrences of a 
pattern except for those with start positions among a small number of 
most recently processed symbols of T. In the next section we will show how 
to fix these issues and turn our algorithm into a fully real-time indexing 
solution that reports all occurrences of a pattern. 

The algorithm distinguishes between three types of query patterns de- 
pending on their length: long patterns contain at least (log log n) 2 symbols, 
medium-size patterns contain between (log® n) 2 and (log log n) 2 symbols, 
and short patterns contain less than (log® n) 2 symbols 5 . For each of the 
three types of patterns, the algorithm will maintain a separate data struc- 
ture supporting queries in 0(\P\ + k) time for matching patterns of the 
corresponding type. 



5 Henceforth, log* 3 ' n = log log log n. 
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3.1 Long Patterns 

To match long patterns, we maintain a sparse suffix tree Tl storing only 
suffixes that start at positions q ■ d for q > 1 and d = log log nj (4 logo - ). 
Suffixes stored in 71 are regarded as strings over a meta-alphabet of 
size a d = log 1 / 4 ?!. This allows us to use the method of Section 2.2 to 
maintain 71- Recall that the method maintains a list of sorted suffixes 
that we denote Cl- 

Using 71 we can find occurrences of a pattern P that start at positions 
qd for q > 1, but not occurrences starting at positions qd+5 for 1 < 5 < d. 
To be able to find all occurrences, we maintain an additional list Ce 
defined as follows. 

The list C e contains copies of all nodes of 71 as they occur during the 
Euler tour of 71- Thus, Ce contains one element for each leaf and two 
elements for each internal node of 71- The first copy of an internal node 
u precedes the copies of all nodes in the subtree of u, and the second 
copy of u occurs immediately after the copies of all descendants of u. 
To simplify the presentation, we will not distinguish between elements 
of Ce and suffix tree nodes that they represent. If a node of Ce is a 
leaf that corresponds to a suffix T[i..], we mark it with the meta-symbol 
[i, d] = tj + itj + 2 . . . t i+( i which is interpreted as the color of the leaf for 
the suffix T[i..\. Colors are ordered by lexicographic order of underlying 
strings. If S = s\ . . . Sj is a string with j < d, then S defines an interval of 
colors, denoted [minc(S), maxc(S)}, corresponding to all strings of length 
d with prefix S. Recall that there are log 1 / 4 n different colors. On list Ce, 
we maintain the data structure of Lemma 1 for colored range reporting 
queries. 

After reading character ti where i = qd for q > 1, we add the suffix 
T[z..], viewed as a string over the meta-alphabet of cardinality log 1//4 n, 
to 71 according to the algorithm described in Section 2.2. In addition, 
we have to update the list Ce, i-e. to insert to Ce the new leaf holding 
the suffix T[i..] marked with the color ti+±ti+2 ■ ■ - U+d- (Note that here 
the algorithms "looks ahead" for the forthcoming d letters of T.) If a 
new internal node has been inserted into 71, we also update the list Cl 
accordingly. (Details are left out and can be found e.g. in [12].) 

Since the meta-alphabet size is only log 1//4 n, the navigation in 71 
from a node to a child can be supported in O(l) time. Observe that 
the children of any internal node v G 71 are naturally ordered by the 
lexicographic order of edge labels. We store the children of v in a data 
structure V v which allows us to find in time 0(1) the child whose edge 
label starts with a string (meta-symbol) S = si . . . s^. Moreover, we can 
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also compute in time O(l) the "smallest" and the "largest" child of v 
whose edge label starts with a string S = s± . . . Sj with j < d. V v will also 
support adding a new edge to V v in 0(1) time. Data structure V v can 
be implemented using e.g. atomic heaps [6]; since all elements in V v are 
bounded by log 1//4 n, we can also implement V v as described in [14]. 

We now consider a long query pattern P = p± . . . p m and show how 
the occurrences of P are computed. An occurrence of P is said to be 
a (5-occurrence if it starts in T at a position j = qd + 5, for some q. 
For each 5, < 8 < d — 1, we find all cooccurrences as follows. First 
we "spell out" P$ = ps+i---Pm in 71 over the meta- alphabet, i.e. we 
traverse 71 proceeding by blocks of up to d letters of E. If this process 
fails at some step, then P has no 5-occurrences. Otherwise, we spell out 
P$ completely, and retrieve the closest explicit descendant node vg, or a 
range of descendant nodes v l s , v l s +1 , . . . ,Vg in the case when P$ spells to an 
explicit node except for a suffix of length less than d. The whole spelling 
step takes time 0(\P\/d + 1). 

Now we jump to the list Ce and retrieve the first occurrence of vg 
(or v\) and the second occurrence of v$ (or v r s ) in Ce- A leaf u of T 
corresponds to a 5-occurrence of P if and only if u occurs in the sub- 
tree of vs (or the subtrees of v s ,...,Vg) and the color of u belongs to 
[minc(ps . . .pi),maxc(ps ■ ■ -Pi)\- In the list Ce, these leaves occur pre- 
cisely within the interval we computed. Therefore, all 5-occurrences of P 
can be retrieved in time O (log log n + ks) by a colored range reporting 
query (Lemma 1), where k$ is the number of 5-occurrences. Summing up 
over all 5, all occurrences of a long pattern P can be reported in time 
0(d(\P\/d + loglogn) + k) = 0(\P\ + dloglogn + k) = 0(\P\ + k), as 
d = log log nj (4 log a), a = 0(1) and \P\ > (loglogn) 2 . 

3.2 Medium-Size Patterns 

Now we show how to answer matching queries for patterns P where 
(log (3) n) 2 <\P\ < (log log n) . In a nutshell, we apply the same method 
as in Section 3.1 with the main difference that the sparse suffix tree will 
store only truncated suffixes of length (loglogn) 2 , i.e. prefixes of suffixes 
bounded by (loglogn) 2 symbols. We store truncated suffixes starting at 
positions spaced by log*- 3 ) n = log log log n symbols. The total number 
of different truncated suffixes is at most (j( loglogn ) . This small number 
of suffixes will allow us to search and update the data structures faster 
compared to Section 3.1. We now describe the details of the construction. 

We store all truncated suffixes that start at positions qd' , for q > 1 
and d' = log^ 3 -* n, in a tree Tm- 7m is organized in the same way as 
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the standard suffix tree; that is, 7m is a compressed trie for substrings 
T[qdl ..qd' — (log log n) 2 + 1], where these substrings are regarded as strings 
over the meta-alphabet E d . 6 Observe that the same truncated suffix can 
occur several times. Therefore, we augment each leaf v with a list of colors 
Col(v) corresponding to left contexts of the corresponding truncated suffix 
S. More precisely, if S = T[qd' ..qd' — (log log n) 2 + 1] for some q > 1, then 
T [qd', d'} is added to Col{v). Note that the number of colors is bounded 
by d log(3) n . Futhermore, for each color col in Col(v), we store all positions 
i = qd' of T such that S occurs at i and T [i, d'] = col. As in Section 3.1, we 
store a list Cm that contains colored elements corresponding to the Euler 
tour traversal of 7m- For each internal node, Cm contains two elements. 
For every leaf v and for each value col in its color list Col(v), Cm contains 
a separate element colored with col. Observe that since the size of Cm 
is bounded by O(cr( loglogn ) 2+log(3) n ), updates of Cm can be supported in 
0(log log(cr( loglogn ) )) = 0(log^ n) time, and colored reporting queries 
on Cm can be answered in 0{\og^ n + k) time (see Lemma 1). 

Truncated suffixes are added to 7m using a method similar to that 
of Section 3.1. After reading a symbol t q( i> for some q > 1, we add 
S nC w = T[qd'..qd' — (log log n) 2 + 1] colored with ^T[qd',d'} to the tree 
7m- To find the place of S new in the list of leaves of 7m, here we compare 
truncated suffixes directly rather than using predecessor queries, as in 
Kopelowitz's algorithm (Section 2.2). Observe that every truncated suffix 
can be viewed as an integer in the range [l-.U] for U = 0-( loglogn ) . We 
store current truncated suffixes in the van Emde Boas data structure V. 
Using V, we can find the largest S prcv < S new and the smallest S nex t > 
S'ncw in 7m in 0(loglogC7) = 0(log {3) n) time. Let I' = lcp(S pTCV , S" ne „) , 
I" = lcp(S ncx t, £ncw)> and £ = max(£' ,£"). Observe that lep values can be 
computed in O(l) time using standard bit operations. Once i is known, 
we update the tree 7m spending 0(loglog |7m|) = O(log^n) expected 
time on the weighted level ancestor query. Finally, we update Cm '■ if Cm 
already contains a leaf with string value S new and color T [qd', d'\, we add 
qd' to the list of its occurrences, otherwise we insert a new element into 
Cm and initialize its location list to qd' . Altogether, the addition of a new 
truncated suffix S ncw requires 0{\og^ n) time. 

A query for a pattern P = p\ . . .p m , such that (log^ 3 ^ n) 2 < m < 
(log log n) 2 , is answered in the same way as in Section 3.1. For each p = 



For simplicity we assume that log 1 ' n and log log n are integers and log ( ' n divides 
log log n. If this is not the case, we can find d! and d that satisfy these requirements 
such that log log n < d < 2 log log n and log' 3 - 1 n < d' < 2 log' 3 ) n. 
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0, . . . , log^ ' n— 1, we find locus nodes v p , . . . ,v p (possibly with v p = v p ) of 
P p = Pp+i ■ . -Pm- Then, we find all elements in Cm occurring between the 
first occurrence of v l and the second occurrence of v p and colored with a 
color col that belongs to [minc(p p . . -pi), maxc(p p . . -Pi)]- For every such 
element, we traverse the associated list of occurrences: if a position i is in 
the list, then P occurs at position (i+p). The total time needed to find all 
occurrences of a medium-size pattern P is 0(d'(\P\/df + log® n) + k) = 
0{\P\ + (log® n) 2 + k) = 0(\P\ + k) since \P\ > (log® n) 2 . 

3.3 Short Patterns 

Finally, we describe our indexing data structure for patterns P with \P\ < 
(log® n) 2 . We maintain the tree Ts of truncated suffixes of length A = 
(log (3) n) 2 seen so far in the text. For every position i of T, 7s contains the 
substring T[i..i — A + l]. Ts is organized as a compacted trie. We support 
queries and updates on Ts using tabulation. There are 0{2 a ) different 
trees, and 0(a A ) different queries can be made on each tree. Therefore, we 
can afford explicitly storing all possible trees Ts and tabulating possible 
tree updates. Each internal node of a tree stores pointers to its lefmost 
and rightmost leaves, the leaves of a tree are organized in a list, and each 
leaf stores the encoding of the corresponding string Q. 

The update table T u stores, for each tree Ts and for any string Q, 
\Q\ = A, a pointer to the tree T' s (possibly the same) obtained after 
adding Q to Ts- Table T u uses 0(2°~ A a A ) = o(n) space. The output table 
T stores, for every string Q of length A, the list of positions in the 
current text T where Q occurs. T has o A = o(n) entries and all lists of 
occurrences take 0{n) space altogether. 

When scanning the text, we maintain the encoding of the string Q of 
A most recently read symbols of T. The encoding is updated after each 
symbol using bit operations. After reading a new symbol, the current 
tree Ts is updated using table T n and the current position is added to 
the entry T [Q]. Updates take 0(1) time. 

To answer a query P, \P\ < A, we find the locus u of P in the current 
tree Ts, retrieve the leftmost and rightmost leaves and traverse the leaves 
in the subtree of u. For each traversed leaf v\ with label Q, we report the 
occurrences stored in T o [0J. The query takes time 0(\P\ + k). 

4 Real-Time Indexing 

The indexes for long and medium-size patterns, described in Sections 3.1 
and 3.2 respectively, do not provide real-time indexing solutions for several 
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reasons. The index for long patterns, for example, requires to look ahead 
for the forthcoming d symbols when processing symbols tj for i = qd, 
q > 1. Furthermore, for such i, we are unable to find occurrences of 
query patterns P starting at positions tj_i . . . U-d+i before processing t; L . 
A similar situation holds for medium-size patterns. Another issue is that 
in our previous development we assumed the length n of T to be known, 
whereas this may of course not be the case in the real-time setting. In 
this Section, we show how to fix these issues in order to turn the indexes 
real-time. Firstly we show how the data structures of Sections 3.1 and 3.2 
can be updated in a real-time mode. Then, we describe how to search for 
patterns that start among most recently processed symbols. We describe 
our solutions to these issues for the case of long patterns, as a simple 
change of parameters provides a solution for medium-size patterns too. 
Finally, we will show how we can circumvent the fact that the length of 
T is not known in advance. 

In the algorithm of Section 3.1, the text is partitioned into blocks of 
length d, and the insertion of a new suffix T[i..] is triggered only when the 
leftmost symbol ti of a block is reached. The insertion takes time 0(d) 
and assumes the knowledge of the forthcoming block t i+c i . . . U + \. To turn 
this algorithm real-time, we apply a standard deamortization technique. 
We distribute the cost of the insertion of suffix T[i — d..] over d symbols of 
the block ti +( i . . . U+i. This is correct, as by the time we start reading the 
block ti + d . . . fj+i, we have read the block t j . . . ij-d+i and therefore have 
all necessary information to insert suffix T[i — d..]. In this way, we spend 
0(1) expected time per symbol to update all involved data structures. 

Now assume we are reading a block ti + d . . . U + ±, i.e. we are processing 
some symbol ti + s for 1 < S < i. At this point, we are unable to find occur- 
rences of a query pattern P starting at U + s . ■ ■ U + i as well as within the 
two previous blocks, as they have not been indexed yet. This concerns up 
to (3d — 1) most recent symbols. We then introduce a separate procedure 
to search for occurrences that start in 3d leftmost positions of the already 
processed text. This can be done by simply storing T in a compact form 
T c where every log^. n consecutive symbols are packed into one computer 
word . Thus, T c uses 0(\T\/ log CT n) words of space. Using T c , we can test 
whether T[j..j — \P\ + 1] = P, for any pattern P and any position j, in 
0([|P|/log .n]) = o(\P\/d) + 0(l) time. Therefore, checking 3d positions 
takes time o(|P|) + 0(d) = 0(\P\) for a long pattern P. 

We now describe how we can apply our algorithm in the case when 
the text length is not known beforehand. In this case, we assume \T\ to 

7 In fact, it would suffice to store 3d — I most recently read symbols in compact form. 
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take increasing values no < n\ < . . . , as long as the text T keeps growing. 
Here, no is some appropriate initial value and n, = 2nj_i for i > 1. 

Suppose now that rii is the currently assumed value of \T\. After we 
reach character t n ./ 2 , during the processing of the next n^/2 symbols, 
we keep building the index for \T\ = rii and, in parallel, rebuild all the 
data structures under assumption that \T\ = n^+i = 2n^. In particular, 
if loglog(2nj) / log log rii, we build a new index for long patterns, and if 
log( 3 )(2nj) / log^nj, we build a new index for meduim-size and short 
patterns. If log cr (2nj) ^ log CT rii, we also construct a new compact repre- 
sentation T c introduced earlier in this section. Altogether, we distribute 
the construction cost of the data structures for T[nj..l] under assumption 
\T\ = 2m over the processing of t n ./2+\---t ni . Since 0(rn) = 0(rii/2), 
processing these nj/2 symbols remains real-time. By the time t Hi has been 
read, all data structures for \T\ = 2n» have been built, and the algorithm 
proceeds with the new value |T| = rij+i. Observe finally that the intervals 
[n^/2 + l,nj] are all disjoint, therefore the overhead per letter incurred 
by the procedure remains constant. In conclusion, the whole algorithm 
remains real-time. We finish with our main result. 

Theorem 1. There exists a data structure storing a text T that can be 
updated in 0(1) worst-case expected time after prepending a new symbol 
to T. This data structure reports all occurrences of a pattern P in the 
current text T in 0(\P\ + k) time, where k is the number of occurrences. 

5 Conclusions 

In this paper we presented the first real-time indexing data structure that 
supports reporting all pattern occurrences in optimal time 0(|P| +k). As 
in the previous works on this topic [11,2,4], we assume that the input 
text is over an alphabet of constant size. It may be possible to extend our 
result to alphabets of poly-logarithmic size. 

Our algorithm spends a constant expected worst-case time for updating 
the data structure when a new text symbol arrives. The expectation comes 
only from the updates of the weighted level ancestor structure [10], which, 
in turn, comes from the updates for the dynamic predecessor problem (y- 
fast tries). We feel that one can get rid of the expectation, however we 
have not found a solution to this so far. 
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